A Profile-Based Authorship Attribution Approach to Forensic Identification in Chinese Online Messages
نویسندگان
چکیده
With the popularity of Internet technologies and applications, inappropriate or illegal online messages have become a problem for the society. The goal of authorship attribution for anonymous online messages is to identify the authorship from a group of potential suspects for investigation identification. Most previous contributions focused on extracting various writing-style features and employing machine learning algorithms to identify the author. However, as far as Chinese online messages are concerned, they contain not only Chinese characters but also English characters, special symbols, emoticons, slang, etc. It is challenging for word segmentation techniques to segment Chinese online messages correctly. Moreover, online messages are usually short. The performance for short samples would be decreased greatly using traditional machine learning algorithms. In this paper, a profile-based authorship attribution approach for Chinese online messages is firstly provided. N-gram techniques are employed to extract frequency sequences, and the category frequency feature selection method is used to filter common frequent sequences. The profile-based method is used to represent the suspects as category profiles. The illegal messages are attributed to the most likely authorship by comparing the similarity between unknown illegal online messages and suspects’ profiles. Experiments on BBS, Blog, and E-mail datasets show that the proposed profile-based authorship attribution approach can identify the authors effectively. Compared with two instance-based benchmark methods, the proposed profile-based method can obtain better authorship attribution results.
منابع مشابه
Comparative study of Authorship Identification Techniques for Cyber Forensics Analysis
Authorship Identification techniques are used to identify the most appropriate author from group of potential suspects of online messages and find evidences to support the conclusion. Cybercriminals make misuse of online communication for sending blackmail or a spam email and then attempt to hide their true identities to void detection.Authorship Identification of online messages is the contemp...
متن کاملAuthorship attribution of SMS messages using an N-grams approach
The pervasive use of SMS is increasing the amount of digital evidence available on cellular phones. Consequently it has become important to detect SMS authors, as a post-hoc analysis technique deemed useful in criminal persecution cases. This paper investigates an N-grams based approach for determining the authorship of SMS messages. Despite the scarcity of words in SMS messages and the differe...
متن کاملVote/Veto Classification, Ensemble Clustering and Sequence Classification for Author Identification
The Author Identification task for PAN 2012 consisted of three different sub-tasks: traditional authorship attribution, authorship clustering and sexual predator identification. We developed three machine learning approaches for these tasks. For the two authorship related tasks we created various sets of feature spaces, where individual differences in writing styles are assumed to surface in ju...
متن کاملProtecting Critical Infrastructure and Key Assets
The Internet is a critical infrastructure and asset in the information age. However, cyber criminals have been using various web-based channels (e.g., email, web sites, Internet newsgroups, and Internet chat rooms) to distribute illegal materials. One common characteristic of these channels is anonymity. Compared with conventional crimes, cybercrime conducted through such anonymous channels imp...
متن کاملA framework for authorship identification of online messages: Writing-style features and classification techniques
With the rapid proliferation of Internet technologies and applications, misuse of online messages for inappropriate or illegal purposes has become a major concern for society. The anonymous nature of online-message distribution makes identity tracing a critical problem. We developed a framework for authorship identification of online messages to address the identity-tracing problem. In this fra...
متن کامل